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Abstract 

Digital signal processing (DSP) techniques for biological sequence analysis continue to grow in popularity due to 
the inherent digital nature of these sequences. DSP methods have demonstrated early success for detection of 
coding regions in a gene. Recently, these methods are being used to establish DNA gene similarity. We present the 
inter-coefficient difference (ICD) transformation, a novel extension of the discrete Fourier transformation, which can 
be applied to any DNA sequence. The ICD method is a mathematical, alignment-free DNA comparison method that 
generates a genetic signature for any DNA sequence that is used to generate relative measures of similarity among 
DNA sequences. We demonstrate our method on a set of insulin genes obtained from an evolutionarily wide range 
of species, and on a set of avian influenza viral sequences, which represents a set of highly similar sequences. We 
compare phylogenetic trees generated using our technique against trees generated using traditional alignment 
techniques for similarity and demonstrate that the ICD method produces a highly accurate tree without requiring 
an alignment prior to establishing sequence similarity. 
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Introduction 

Substantial technological advances continue to be made in 
modern DNA sequencing instrumentation. Next-generation 
sequencing (NGS) systems generate genetic and genomic 
data at unprecedented rates. Methods that can be used to 
help us understand these data are being researched in earn- 
est. In general, the most common, biologically meaningful 
approach to understand new sequence data are based on 
methods that can compare new data against a large set of 
data that is well understood. 

When a new biological sequence with unknown func- 
tion has been identified, researchers search for the most 
similar' sequence in a database of annotated sequence 
data, under the premise that similar sequences imply simi- 
lar biological functionality, and in the case of proteins, 
similar structural characteristics. Similarity between two 
biological sequences forms the basis for determining 
whether the sequences are homologous, i.e., there is 
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shared ancestry between them [1]. Phylogenetics, the 
study of evolutionary relationships between organisms, re- 
lies on methods that can quantitatively measure differ- 
ences between these organisms, with the premise that 
larger differences between organisms imply a larger span 
of time before the organisms split from a common ances- 
tor. Phylogenies are most commonly inferred from pair- 
wise comparisons performed on the underlying genetic 
sequence data obtained from the organisms being ana- 
lyzed [2,3]. For these and many other reasons, sequence 
analysis methods are among the most researched and 
sought after methods in bioinformatics. We encourage the 
reader to consult a text on biological sequence analysis to 
learn about existing methods [1,4]. 

Generally speaking, the predominant methods for bio- 
logical sequence comparison are based on sequence align- 
ments, such as the popular BLAST and the ClustalW 
series of methods [5,6]. Alignment methods have repre- 
sented the de facto standard for sequence analysis, com- 
parison, and retrieval. However, the advent of NGS 
sequencing has pushed traditional alignment methods to 
their limits. There are numerous user-defined parameters 
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for dealing with gaps and mismatches between sequences, 
and it is difficult to determine the ideal parameters to 
achieve an optimal alignment. The computational re- 
sources required for these methods can increase quadrati- 
cally or more with respect to the length of the sequences 
and the number of sequences being aligned [1]. Moreover, 
there is an increased risk of errors being introduced with 
multiple sequence alignments as the average pairwise se- 
quence identity of the data being aligned decreases. An- 
other source of alignment error arises if the order of 
significant regions in sequences is not conserved [7]. If an 
optimal alignment has been found, it is difficult to deter- 
mine an accurate metric of distance between sequences 
[8]. Despite these challenges, alignment methods continue 
to be used. With appropriate parameter selection, they 
excel at visually indicating regions that are highly con- 
served among many sequences. 

To overcome these challenges, there has been in- 
creased interest in techniques that can compare se- 
quences without an alignment, referred to as alignment- 
free methods [7-9]. The most popular alignment-free 
methods are based on computing various transforma- 
tions of fixed-length words of length n (or w-mers, n- 
grams), with common approaches involving computing a 
frequency vector over all possible w-mers for each se- 
quence [2]. Other methods search for a shared set of the 
longest common subsequences [10]. These methods 
tend to be among the most efficient, as their computa- 
tional complexity is linear [9]. However, they may lose 
valuable information with respect to positioning of im- 
portant subsequences within the whole sequence. More- 
over, like alignment-based methods, they often require 
multiple runs to select the most ideal parameters. 

Digital signal processing (DSP) techniques have been 
used effectively for efficient searching and comparison of 
sequential data [11,12]. They are emerging as another al- 
ternative alignment-free approach used to analyze both 
genomic and proteomic data. In order for these data to 
be processed using DSP techniques, they must be con- 
verted to a numeric sequence. There are several numeric 
representations available, each with their own strengths 
and weaknesses [13-15]. In the case of DNA, there are a 
limited number of numeric transformations available. 
DNA encodes the genetic blueprint of every organism as 
a sequence over four possible nucleotides, represented 
as A, C, G, or T. Encoded in DNA are genes, which con- 
tain the instructions to make proteins, and intergenic re- 
gions, which fill in the large gaps between genes. Within 
each gene are coding regions (exons) and noncoding re- 
gions (introns). The information content, which is crit- 
ical to understanding the biological function of the gene, 
is hidden in the coding regions in the gene. Coding re- 
gions are comprised of codons, nucleotide triplets that 
code for individual amino acids, and represent a very 



small portion of the entire genome. In the human gen- 
ome, only about 5% of it contains coding instructions. 
These complexities make the process of identifying 
genes and coding regions within these genes a daunting 
task. 

Proteins have more choices of possible numeric trans- 
formations available, owed in part to the physicochemi- 
cal properties of amino acids. Proteins themselves are 
long polypeptide chains of amino acids. There are 20 
possible amino acids that exist in proteins, each having 
many physicochemical properties, such as hydropathy, 
charge, and solubility. These properties provide useful 
numeric representations for protein sequences, making a 
translation to a numeric sequence a relatively easy process. 
For example, the MAFFT method is a protein sequence 
alignment method that converts converted proteins into 
numeric sequences that represent the polarity and volume 
values of each amino acid residue in the proteins being 
aligned [16]. 

Regardless of the numeric transformation chosen, pres- 
ervation of information content in the sequence is critical. 
This is perhaps one reason for the most common repre- 
sentation of a DNA, the binary indicator sequence, also 
commonly known as the Voss representation [17]. In this 
representation, each DNA sequence is transformed into a 
sequence of binary occurrence vectors. (This is the trans- 
formation used in our research, and is described in detail 
in the Methods and materials section). Some methods use 
variations of the binary indicator sequence. For example, 
Afrexio et al. introduced a variant of the Voss representa- 
tion that converts the occurrence vector into a vector of 
inter-nucleotide distances [18]. There is a wide range of 
transformations available [13]. Hota et al. analyzed the 
performance of several common DNA to numerical map- 
ping techniques. They provide a good description of each 
transformation method used in practice [19]. 

DSP based methods have continued to emerge in recent 
years for the purpose of genomic analysis. The most preva- 
lent use has been to locate reading frames in DNA, as well 
as different regions in the genome, including genes and cod- 
ing (or exon) regions within these genes [14,20,21]. Sharma 
et al. analyzed the performance of several DNA mapping 
schemes for detecting the coding region of genes [15]. DSP 
techniques have been used to address other problems in 
genomics and proteomics. For example, methods have been 
developed for splice site detection within the gene [20], the 
identification of active sites in a protein using Morlet wave- 
lets [22] and identification of acceptor splicing sites and the 
visual identification of patterns and motifs in DNA through 
spectral analysis [14,23]. 

Regardless of the domain, the field of digital signal 
processing has provided a plethora of methods for ana- 
lyzing sequential data. Most methods use variations of 
the Fourier transform [24], with the discrete Fourier 
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transform (DFT) being among the most popular signal 
processing technique [25,26]. Typically, the fast Fourier 
transform (FFT) is used to compute the DFT, as it is 
among the most computationally efficient algorithms for 
this purpose [24]. These transforms have been success- 
fully used for general sequential data comparison and re- 
trieval [11,12], and are readily suitable for biological 
sequence comparison, owed to the inherent discrete, 
symbolic nature of biological sequences [14,27,28]. In 
fact, FFTs have been used to analyze DNA data before 
[20,29,30]. In addition to some of the methods listed 
previously, Cheever et al. measured the cross correlation 
of two DNA sequences to explore significant regions of 
similarity between the DNA, where the cross correlation 
was computed using a FFT [31]. The FFT has also been 
used for protein sequence alignments in the MAFFT 
method [16]. 

There have been many DSP-based methods introduced 
in recent years for biological data analysis; however, very 
few were designed to report a biologically relevant meas- 
ure of evolutionary distance between sequences being 
analyzed, particularly when a large number of sequences 
are being analyzed. Multiple sequence alignments have 
been used successfully for this purpose, but these methods 
can be computationally expensive and are prone to errors, 
particularly as the set of sequences being analyzed in- 
crease in size and diversity. We developed a novel signal 
processing technique that characterizes genetic sequence 
data through a simple transformation of the coefficients 
generated by the DFT of a specific numeric representation 
of the original DNA sequence. In our work, we compute a 
transformation on the set of coefficients generated that we 
call the inter-coefficient difference or ICD. We show that 
this characterization effectively produces a signature for a 
given sequence and can be used to compare genetic se- 
quences among different species. The ICD method pro- 
vides comparisons between genes from evolutionarily 
distant species, as well as subtle variants from identical 
genes from the same species. We demonstrate its ef- 
fectiveness through analysis of datasets that have dif- 
ferent levels of pairwise similarity. The method effectively 
generates a pairwise distance matrix representing the level 
of similarity between each genetic sequence with remark- 
able running times. The resulting matrix can be used to in- 
duce a dendrogram representing phylogenetic relationships 
between species from which the sequences were obtained. 
Our results show that we produce alignment-free dendro- 
grams that are highly similar to those trees produced using 
alignment-based techniques and other alignment-free 
methods. 

Methods and materials 

Our method is based on the application of the DFT to 
four numeric sequences that are derived from the original 



DNA sequence. We use a binary indicator sequence rep- 
resentation of a DNA sequence, which is among the most 
popular numeric representation used in this area in litera- 
ture [17,20]; it allows for an easy transformation from the 
original sequence on which many DSP and other numeric 
transformations can be computed [18,20,27]. 

The inter-coefficient difference 

Let S represent a set of DNA sequences, where s t repre- 
sents an arbitrary sequence in S. Each DNA sequence s t 
is defined over the alphabet. Let N be the length of the 
longest sequence in S. Each sequence s t in S goes 
through a series of transformations to produce the cor- 
responding ICD vector. The first transformation com- 
putes a unique binary indicator sequence from s t . Next, 
we apply the DFT on the indicator sequence, yielding a 
vector of coefficients. Basic mathematical transforma- 
tions are applied to the coefficient vector, resulting in 
the ICD vector. The details of this algorithm are given 
below. 

For a given sequence s u we define four binary indicator 
sequences x A [n], x c [n], x G [n], and x T \n\, which indicate 
the presence (i.e., a 1) or absence (i.e., a 0) of a symbol 
in Si at position n. Each indicator sequence is padded 
with zeros to ensure that every indicator sequence in S 
has an identical length of N. Zero padding is a common 
technique with FFT computations that can increase the 
spectral resolution and can increase the efficiency of the 
computation when the length of the original sequence is 
padded to a power of 2 [26]. For example, let s t = GAC 
GACTCAT, which has a length of 10. However, suppose 
that N, which is the length of the longest sequence in S, 
is 12. Then: 

Si = GACGACTCAT 

x A = 010010001000 

x c = 001001010000 

x G = 100100000000 

x T = 000000100100 

For each indicator sequence, we compute the DFT, 
which converts the finite-length sequence x A [n] into a 
series of coefficients X A [k] resulting from the DFT com- 
putation, defined in Equation 1: 

X A [k] = J2x A [n]e-'(^) k = 0, 1, ...,N-1 (1) 

The coefficients produced are complex, and thus the 
absolute value of each coefficient is computed, yielding a 
series of real valued numbers. X A [0] represents the num- 
ber of 1 s in the indicator sequence x A . It is discarded 
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because it is substantially larger than all other coeffi- 
cients and is highly dependent on the length of the 
original unpadded sequence. We retain coefficients 
X A [l],X A [2],...,X A [ N / 2 ], eliminating half of the coeffi- 
cients because of the symmetric nature of the coeffi- 
cients produced by the DFT [26]. The remaining 
coefficients are denoted as vector X^. We normalize 
by dividing by its Euclidean norm, ||X^||, resulting 
in X^. Equation 2 illustrates this transformation, intro- 
ducing variable rj for simplicity: 



(2) 



x; 



^ A -[\X A [l]\,\X A { 



X* 



For each vector X A , we compute the inter-coefficient 
difference of X A , denoted ICD(X A ), by computing the 
difference between each adjacent number in the se- 
quence as shown in Equation 3: 

ICD(X^) = [X A [2]-X A [1],X A [3]-X^[2],..,X A M-X A [^-1]] 

(3) 

The same computations are repeated for indicator se- 
quences Xc, x G , and x T , yielding vectors X c , X G , and X r 
separately. 

For example, continuing from our previous example 
indicator sequence, x A = 010010001000, and N = 12. We 
apply Equations 1 and 2 above on x A) which computes 
the DFT on x A and normalizes it, resulting in the vector 
of coefficients X^: 

X* = [0.5176, 1.0000, 2.2361, 1.7321, 1.9319, 1.0000] 

||X* || = 3.7417 

X A = [0.1383,0.2673,0.5976,0.4629,0.5163,0.2673] 

Then, the inter-coefficient difference of X A is com- 
puted, resulting in: 

ICD(Xa) = [0.1289,0.3304, -0.1347,0.0534, -0.2490] 

The ICD of each coefficient vector resulting from vec- 
tors X c , X G , and X r is concatenated to produce a single 
numeric vector, denoted X. 

X = [ICD(X A )ICD(X c )ICD(X G )ICD(X r )] 

It is important to mention that all ICD vectors will have 
an equal length for every sequence in S, regardless of the 
length of the original sequence. Each indicator sequence 
transformation is padded to have a length of N, which is 
the length of the longest sequence in S. The final 
concatenated vector X will have a length of \ N / 2 \ = 4^. 



Establishing distance between DNA sequences 

Given two arbitrary DNA sequences, Si and s 2 in set S, 
we can compute the ICD transformation yielding nu- 
meric vectors X 2 and X 2 , respectively. A single numeric 
value that represents a measure of biological distance is 
computed from these vectors by computing the correl- 
ation between the two vectors. We compute Dist (X!, 
X 2 ), a single measure of distance between the ICD vec- 
tors, as follows: 



4/7 



Dist(X!,X 2 ) = 1.0- 



4/7 4/7 

J2(x 1 [i\-x 1 fJ2(x 2 [i\-x 2 f 

\ i=l i=l 

(4) 



Equation 4 is 1.0 minus a standard correlation calcula- 
tion between two sets of data. We know that a standard 
correlation falls in the range [-1.0, 1.0], where -1.0 is a 
perfect negative correlation and 1.0 is a positive correl- 
ation. Two vectors of identical values would have perfect 
positive correlation, and thus their Dist calculation 
would be 0.0, implying that there is no distance between 
them. A value of 2.0 is perfect negative correlation, im- 
plying opposing numerical trends around the means. 

Data 

To test the efficacy of this method, we assembled two 
sets of DNA data. Our first set consisted of mRNA insu- 
lin sequences from 19 different animals, called INS19 
(Table 1). Insulin is an important hormone found through- 
out the animal kingdom for regulating carbohydrate and fat 
metabolism and for managing glucose levels in the blood. 
All sequences were downloaded from NCBI's RefSeq data- 
base (http://www.ncbi.nlm.nih.gov/refseq/). This dataset 
was chosen to measure the ability of the method to assess 
pairwise similarity over a set of sequences that have highly 
conserved regions in its genetic sequence owed to its similar 
function among all species while exhibiting substantial re- 
gions of low conservation in proportion to the evolutionary 
distance between species. The length of the sequences in the 
data ranged between 291 and 774 nucleotides in length. 

Our second set of data was chosen to test the ability 
of the method to accurately distinguish subtle differ- 
ences among a large set of sequences from the same 
gene obtained from the same viral species. To this end, 
we selected 60 influenza type A sequences collected 
from the NCBI Influenza Virus Sequence Database 
(http://www.ncbi.nlm.nih.gov/genomes/FLU/). Influenza is 
an RNA virus that affects a wide range of mammals and 
birds; in extreme cases, it can lead to death. Influenza A vi- 
ruses are broken down into different subtypes that are 
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Table 1 mRNA insulin sequences from 19 animal species 
in the INS19 dataset 

Species Common name Accession Length 

H. sapiens Human NM_000207 469 

P. troglodytes Chimp NM_001 008996 416 

0. baboon Olive baboon XM_003909376 505 

M. fascicularis Monkey J00336 392 

B. taurus Cow NM_1 73926 434 
S.scrofa Pig NM_001 109772 435 
G.gallus Chicken NM_205222 453 
Cfamiliaris Dog NM_001 130093 463 
F. catus Cat AB043535 420 
C procellus Guinea pig NM_001 172891 442 

C. cristata Star-nosed mole XM_004695041 291 

E. telfairi Hedgehog XM_004717178 327 
M. auratus Hamster XM_005064148 450 
O.cuniculus Rabbit NM_001 082335 433 

D. rerio Zebrafish AF036326 468 
P.buchholzi Butterfly fish AF1 99588 459 
Cchitala Clown knifefish AF1 99586 375 

F. albicollis Flycatcher XM_005 046804 324 
X.laevis Clawed frog NM_001 085882 774 



named based on two specific proteins that are on the sur- 
face of the virus: hemagglutinin (HA) and neuraminidase 
(NA). There are 17 types of the HA protein and 10 types 
of neuraminidase NA protein. Each virus receives a desig- 
nation labeled HxNy, where x represents a specific subtype 
of the HA gene and y represents a subtype of the NA gene 
in the virus. Our dataset, denoted FLU60, contains 60 ex- 
amples of avian influenza sequences (influenza sequences 
known to affect birds) for the HA gene only, collected from 
various locations in the United States between January and 
July of 2010. Avian flu strands were selected because all 
known subtypes of influenza A can affect birds. The length 
of all sequences in FLU60 ranged between 1,683 and 1,746 
nucleotides in length. The frequency of influenza A sub- 
types in the dataset are detailed in Table 2. The most dom- 
inant variant in the data is H4N6 at 25 examples, with 
H3N% variants coming in second. Because we collected 
only examples of the HA gene, only the Wx part of the sub- 
type name should play a role in determining similarity. 
Additional file 1: Table SI has detailed information about 
the dataset, including the accession number, subtype, date 
and place that specimen was acquired, and the length of 
each sequence [see Additional file 1]. 

Results 

To assess the capability of the ICD method to measure 
sequence similarity, we generated a dendrogram based 



Table 2 Avian influenza A subtype frequency in FLU60 



Influenza A subtype Frequency 



H1N1 


3 


H1N3 


1 


H3N1 


1 


H3N6 


1 


H3N8 


13 


H4N6 


25 


H6N1 


2 


H7N3 


6 


H9N2 


1 


H10N7 


4 


H11N9 


2 


H12N5 


1 



on a hierarchical clustering using the unweighted pair 
group method average (UPGMA) method for construct- 
ing the tree. This was performed for both INS 19 and 
FLU60 datasets. For comparative purposes, we com- 
puted an all-against-all pairwise global alignment using 
the standard Needleman-Wunsch algorithm for each set 
of sequences being tested [32], utilizing a uniform nu- 
cleotide substitution matrix (as defined by the nuc44 
function in the Matlab® Bioinformatics Toolbox) for the 
purpose of finding the best alignment. Though comput- 
ing a pairwise alignment for all possible pairs of se- 
quences is computationally expensive, this will yield a 
superior alignment than any single multiple sequence 
alignment (MSA), as it significantly reduces the likeli- 
hood of introducing alignment errors that result from an 
MSA. The distance between each pair of aligned se- 
quences was computed by measuring the proportion of 
sites in the alignment at which the two sequences are 
different, yielding a score of 1 for entirely dissimilar se- 
quences and 0 if they were identical. This distance meas- 
ure yields identical groupings to those that are generated 
directly from the alignment score itself but has a com- 
parative advantage of producing numbers that are in an 
identical range to the distance values that are produced 
with the ICD method. ICD uses a correlation coefficient 
between coefficient differences and likewise always pro- 
duces a distance value between 0 and 1. We also compared 
our results to an alignment-free sequence comparison 
method called feature frequency profile (FFP), which is a 
popular tool for phylogenetic analysis [2]. We used default 
parameters on all FFP tools to generate a tree, with the ex- 
ception of word size; we evaluated word sizes between 6 
and 20 and determined that a word size of 16 achieved re- 
sults that produced the most biologically correct phylogen- 
etic groupings. Finally, the Robinson-Foulds (RFdist) tree 
distance metric is computed on the INS 19 test using the 
treedist function in the phangorn package in R [33,34]. 



H. sapiens 


Human 


NM_000207 


P. troglodytes 


Chimp 


NM_001 008996 


0. baboon 


Olive baboon 


XM_003909376 


M. fascicularis 


Monkey 


J00336 


B. taurus 


Cow 


NM_1 73926 


S. scrota 


Pig 


NM_001 109772 


G. gallus 


Chicken 


NM_205222 


C. familiaris 


Dog 


NM_001 130093 


F. catus 


Cat 


AB043535 


C. procellus 


Guinea pig 


NM_001 172891 


C. cristata 


Star-nosed mole 


XM_004695041 


E. telfairi 


Hedgehog 


XM_004717178 


M. auratus 


Hamster 


XM_005064148 


0. cuniculus 


Rabbit 


NM_001 082335 


D. rerio 


Zebrafish 


AF036326 


P. buchholzi 


Butterfly fish 


AF1 99588 


C. chitala 


Clown knifefish 


AF1 99586 


F albicollis 


Flycatcher 


XM_005046804 


X. laevis 


Clawed frog 


NM_001 085882 



King ef a I. EURASIP Journal on Bioinformatics and Systems Biology 2014, 2014:8 
http://bsb.eurasipjournals.com/content/201 4/1 /8 



Page 6 of 12 



RFdist is computed between all combinations of pairs of 
trees to assist in measuring tree similarity. 

ICD method on INS19 dataset 

Our first test was conducted to measure the ability for 
the ICD method to accurately assess similarity between 
sequences that are relatively divergent, where the data 
was collected from a wide range of eukaryotic species. 
The INS 19 dataset contains data from the insulin gene, 
taken from 19 species in the eukaryotic kingdom. The 
range of pairwise sequence identity after alignment ranged 
between 32% and 89% identity, with an average observed 
percent identity at 60% (see Figure 1). A dendrogram was 
built based on the pairwise similarity computed from the 
ICD method and is shown in Figure 2. For comparison 
purposes, an all-against-all pairwise global alignment 
(denoted AAP) was performed on all sequences, and a 
dendrogram was built revealing the relationships between 
the sequences based on the alignment. A dendrogram was 
also computed based on the alignment-free FFP method [2]. 
The resulting dendrograms from each of these comparative 
methods are shown in Figures 3 and 4, respectively. 

All trees exhibit strong similarities within major group- 
ings, closely resembling phylogenetic relationships observed 
in nature, with some subtle, yet biologically significant dif- 
ferences between each method. In particular, both ICD and 
AAP methods place monkey and chimp as the most similar 
among all species, whereas FFP places human and chimp 
as most similar. All methods suggest the African clawed 
frog as most distant from others species used in this study. 
The FFP method grouped the zebrafish with the clawed 
frog, whereas the ICD and AAP methods correctly cluster 
all three fish species. The AAP method grouped a 



Pairwise aligned sequence identity of INS19 

25 | , , , , , , , , 1 




100 



% identity 



Figure 1 Histogram of observed sequence identity over all 

pairs of aligned sequences in INS19 dataset. The percent identity 

is computed for all possible pairs of sequences in the INS19 dataset. 

Most data averaged between 55% and 75% sequence identity. 
\ ) 



hedgehog, a type of rodent, with a flycatcher, a type of 
bird. In contrast, the ICD and FFP methods correctly 
grouped the flycatcher with a chicken, which are both 
types of birds, and the hedgehog with other similar 
mammals. The AAP method grouped the hamster, a 
rodent, with the cow and pig, which are both even-toed 
ungulates; the FFP method fared a bit better, placing a 
hamster between a rabbit and hedgehog. In contrast, the 
ICD method correctly grouped the hamster with the 
guinea pig, which are both rodents. 

The RFdist distance metric was computed between all 
pairs of trees. The RFdist between the ICD and FFP 
phylogenetic trees is 26, between ICD and AAP is 24, 
and between FFP and AAP is 22. These values suggest 
that, though the trees have similar groups, they have a 
relatively equal number of different partitions of data 
that are implied by each tree, with the final tree pro- 
duced by the ICD method being only slightly more simi- 
lar to the tree produced by the all-against-all pairwise 
alignment than the FFP method. 

ICD method on FLU60 dataset 

Our next test was conducted on the FLU60 dataset, 
which contains 60 DNA sequences of the HA gene from 
avian influenza A virus. Conducting an all-against-all 
pairwise alignment revealed a pairwise sequence identity 
range of 57% to 99.9%, with an average identity of 70.5%. 
Additional file 1: Figure SI shows a histogram revealing 
the sequence identity over all pairs of sequences (see 
Additional file 1). We performed identical analyses on 
these data to the analyses performed with the INS 19 
data, resulting in dendrograms from each method. The 
dendrogram for the ICD method is shown in Figure 5. 
The dendrograms for the AAP and FFP methods are 
shown in Additional file 1: Figures S2 and S3 (see 
Additional file 1). The RFdist metric was not measured 
for this test. 

Close evaluation of these dendrograms will reveal re- 
markably similar groupings among each individual sub- 
type of influenza A. We were pleased to see that all 
influenza HA subtypes were grouped together correctly 
by all methods. In particular, in the case of H3 and H4 
subtypes, all three methods indicated two very distinct 
strains. H3 is divided into a strain that hit Mississippi 
and one that hit Alaska. H4 was divided into three dis- 
tinct strains, with all methods agreeing on the divisions. 
When looking at similarity between subtypes, all methods 
group together influenza A subtype H7 with H10, suggest- 
ing that each of these groups share a common ancestor. 
However, they differ slightly on the ancestry relationships 
between H9, Hll, and H12. These findings, as well as 
most of the other relationships observed in this study, are 
confirmed by Air's work on sequence relationships in the 
hemagluttinin genes of 12 different variants of influenza A 
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Figure 2 ICD-based dendrogram for INS19. This figure shows the resulting dendrogram generated based on the ICD method applied on the 
ICD19 dataset, which contains mRNA sequences taken from 19 different eukaryotic species for the insulin (INS) gene. 



[35]. The methods differ on the divergence point of sub- 
type H6; the AAP and FFP methods suggest that H6 and 
HI have a common ancestor, whereas the ICD method 
suggests that H6 diverged much earlier from a subgroup 
consisting of H4, H9, and H12. The AAP and FFP method 
are closer to the similarity observed in Air's work. How- 
ever, the level of similarity computed by the ICD method 



between H6 and subgroup H4, H9, and H12 is remarkably 
similar to the alternative group HI, Hll, and H3, suggest- 
ing that the common ancestor could have been from ei- 
ther group. 

The execution times were recorded for each of the 
methods we investigated. In addition, we included the 
timing results of ClustalW2 [36] and Clustal Omega 



Figure 3 Alignment-based dendrogram for INS19. This figure shows the resulting dendrogram generated from phylogenetic relationships 
inferred from pairwise alignments computed over all pairs from the INS19 dataset, which contains mRNA sequences taken from 19 different 
eukaryotic species for the insulin (INS) gene. 
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Figure 4 Alignment-free-based dendrogram using FFP [2] method for INS19. This figure shows the resulting dendrogram generated from 
phylogenetic relationships inferred using the FFP method on the INS19 dataset, which contains mRNA sequences taken from 19 different 
eukaryotic species for the insulin (INS) gene. 



[37], which are two other popular multiple sequence 
alignment methods widely used today. All methods were 
run on a laptop computer running Mac OS X 10.9 with 
a 2.2 GHz Intel Core i7 processor equipped with 16 GB 
of memory. The ICD and AAP methods were run as 
Matlab/ applications, while FFP, ClustalW2, and Clustal 
Omega were compiled and installed as native applica- 
tions. All methods were executed multiple times with 
similar loads. The first time was discarded to eliminate bias 
resulting from file system latency. All methods ran under 
one second of execution time on the INS19 dataset. This is 
not surprising given the small size of the dataset and the 
short length of each sequence. The FLU60 dataset provided 
a much more informative comparison. Table 3 shows the 
results for all five methods tested. The results clearly indi- 
cate the strength of alignment-free methods with respect to 
running times. Among the alignment-free methods, the 
ICD method outperformed FFP, despite the fact that it is 
running within the Matlab® framework. This suggests that 
even better execution times may be observed with the ICD 
method if it was redesigned as a native application. 

Discussion 

The binary indicator representation of DNA is a common 
representation to use on methods that treat DNA as a 
digital signal [19]. Some suggest that this representation is 
common because it inherently retains the important three- 
base periodicity which is important for detecting coding re- 
gions in DNA [15]. Some methods make interesting 
transformations to the original indicator sequence, such 



as the inter-nucleotide distance utilized by Afreixo et al. 
[18], with the goal of strengthening signals in the original 
digital signal that are discriminatory between different se- 
quences. We applied the DFT on the indicator sequence. 
The DFT is a common DSP technique on digital signals 
and has been used in other methods for DNA sequence 
analysis. Each coefficient of the DFT represents a cross 
correlation of the entire input sequence and a complex si- 
nusoid at a specific frequency, notably kIN [25]. As noted, 
DFTs have been successfully used in detecting coding re- 
gions of genes, where a strong peak is observed at fre- 
quency N/3. Our work was largely motivated by an 
interest in investigating how the differences of the magni- 
tudes of the sinusoids between adjacent frequencies might 
improve sequence characterization in DNA. The ICD 
method presents a novel use of the DFT by computing 
the inter- coefficient difference from the resulting set of 
coefficients computed by the DFT transformation. The 
analysis presented here demonstrates its potential as a vi- 
able alternative approach toward DNA sequence analysis. 
In particular, the ability to distinguish differences between 
sequences having both low and high measures of hom- 
ology, without computing an alignment, is particularly 
useful, compared to the challenges from computing mul- 
tiple sequence alignments over large amounts of biological 
sequence data. 

One may wonder about the likelihood of two different 
sequences producing an identical ICD vector. The crit- 
ical part of the ICD method is the DFT. Two different 
DNA sequences will produce a unique set of coefficients, 
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H1N1 A/mallard/Mississippi/41 3/2010 2010/01/17 4 (HA) 

H1N1 A/red-necked grebe/Minnesota/A1 10-1 948/2010 2010/07/07 4 (HA) 

H1N3 A/northern shoveler/Mississippi/397/2010 2010/01/17 4 (HA) 

H1N1 A/mallard/Mississippi/442/2010 2010/01/19 4 (HA) 

H1 1N9 A/American green-winged teal/Mississippi/300/2010 2010/01/16 4 (HA) 

H11N9 A/American green-winged teal/Mississippi/383/2010 2010/01/16 4 (HA) 

H3N6 A/American green-winged teal/Interior Alaska/1 0BM04207R0/2010 2010/07/13 4 (HA) 

H3N8 A/northern shoveler/lnterior Alaska/1 0BM05649R0/2010 2010/07/26 4 (HA) 

H3N8 A/northern pintail/Interior Alaska/1 0BM06872R0/2010 2010/07/31 4 (HA) 

H3N8 A/northern pintail/Interior Alaska/1 0BM06704R0/2010 2010/07/30 4 (HA) 

H3N8 A/northern shoveler/lnterior Alaska/1 0BM05382R0/2010 2010/07/25 4 (HA) 

H3N8 A/mallard/lnterior Alaska/1 0BM05970R0/2010 2010/07/27 4 (HA) 

H3N8 A/American green-winged teal/Interior Alaska/1 0BM05376R0/2010 2010/07/25 4 (HA) 

H3N8 A/mallard/lnterior Alaska/1 0BM05797R0/2010 2010/07/26 4 (HA) 

H3N8 A/mallard/lnterior Alaska/1 0BM06828R0/2010 2010/07/31 4 (HA) 

H3N8 A/northern pintail/Interior Alaska/1 0BM00303R0/2010 2010/05/14 4 (HA) 

H3N1 A/northern pintail/Interior Alaska/1 0BM00849R2/2010 2010/05/23 4 (HA) 

H3N8 A/mallard/Mississippi/354/2010 2010/01/16 4 (HA) 

H3N8 A/mallard/Mississippi/386/2010 2010/01/17 4 (HA) 

H3N8 A/mallard/Mississippi/360/2010 2010/01/16 4 (HA) 

H3N8 A/American green-winged teal/Mississippi/285/2010 2010/01/14 4 (HA) 

H6N1 A/Canada goose/Delaware Bay/34/2010 2010/05/19 4 (HA) 

H6N1 A/shorebird/Delaware Bay/380/2010 2010/05/20 4 (HA) 

H12N5 A/mallard/lnterior Alaska/1 0BM021 1 1 R0/2010 2010/06/10 4 (HA) 

H9N2 A/mallard/lnterior Alaska/1 0BM02980R0/2010 2010/07/02 4 (HA) 

H4N6 A/mallard/California/1 154/2010 2010/07/27 4 (HA) 

H4N6 A/mallard/California/1 188/2010 2010/07/29 4 (HA) 

H4N6 A/mallard/California/1 289/2010 2010/07/30 4 (HA) 

H4N6 A/mallard/California/1 21 0/2010 2010/07/29 4 (HA) 

H4N6 A/mallard/California/1 297/2010 2010/07/30 4 (HA) 

H4N6 A/mallard/California/1 156/2010 2010/07/27 4 (HA) 

H4N6 A/American green-winged teal/Interior Alaska/1 0BM06728R0/2010 2010/07/30 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM04618R0/2010 2010/07/18 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM05337R0/2010 2010/07/24 4 (HA) 

H4N6 A/northern pintail/Interior Alaska/1 0BM04704R0/2010 2010/07/18 4 (HA) 

H4N6 A/ring-necked duck/Interior Alaska/1 0BM05617R0/2010 2010/07/26 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM02530R0/2010 2010/06/22 4 (HA) 

H4N6 A/northern pintail/Interior Alaska/1 0BM02585R0/2010 2010/06/22 4 (HA) 

H4N6 A/northern pintail/Interior Alaska/1 0BM051 71 RO/2010 2010/07/22 4 (HA) 

H4N6 A/northern shoveler/lnterior Alaska/1 0BM02593R0/2010 2010/06/23 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM04626R0/2010 2010/07/18 4 (HA) 

H4N6 A/American green-winged teal/Interior Alaska/1 0BM07000R0/2010 2010/07/31 4 (HA) 

H4N6 A/American green-winged teal/Interior Alaska/1 0BM05165R0/2010 2010/07/22 4 (HA) 

H4N6 A/northern shoveler/lnterior Alaska/1 0BM02892R0/2010 2010/06/30 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM05247R0/2010 2010/07/23 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM02644R0/2010 2010/06/25 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM03979R0/2010 2010/07/12 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM06448R0/2010 2010/07/30 4 (HA) 

H4N6 A/mallard/lnterior Alaska/1 0BM06456R0/2010 2010/07/30 4 (HA) 

H4N6 A/northern pintail/Interior Alaska/1 0BM02791 RO/2010 2010/06/25 4 (HA) 

H10N7 A/mallard/Mississippi/329/2010 2010/01/16 4 (HA) 

H10N7 A/northern shoveler/Mississippi/252/2010 2010/01/10 4 (HA) 

H 1 0N7 A/mallard/lnterior Alaska/1 0BM01 929R0/201 0 201 0/06/1 5 4 (HA) 

H10N7 A/northern shoveler/Mississippi/236/2010 2010/01/12 4 (HA) 

H7N3 A/northern pintail/Interior Alaska/1 0BM06306R0/2010 2010/07/29 4 (HA) 

H7N3 A/northern pintail/Interior Alaska/1 0BM06524R0/2010 2010/07/30 4 (HA) 

H7N3 A/mallard/lnterior Alaska/1 0BM05347R0/2010 2010/07/25 4 (HA) 

H7N3 A/northern pintail/Interior Alaska/1 0BM06303R0/2010 2010/07/29 4 (HA) 

H7N3 A/mallard/lnterior Alaska/1 0BM04564R1/2010 2010/07/18 4 (HA) 

H7N3 A/northern pintail/Interior Alaska/1 0BM02539R0/2010 2010/06/22 4 (HA) 



Similarity 

Figure 5 ICD-based dendrogram for FLU60. This figure shows the resulting dendrogram generated based on the ICD method applied on the 
FLU60 dataset, which contains 60 sequences of the HA gene of different subtypes of avian influenza type A. 



and likewise, our ICD transformation applied to these co- 
efficients is thus also unique, except under one condition: 
when one sequence is a rotational shift of the other, and 
both of these sequences represent the longest sequences 
in the set of DNA being analyzed, meaning, there will be 

Table 3 Observed execution time for FLU60 

Method Exec time (sec) 

ClustalW 157.0 

Clustal Omega 27 0 

Pairwise alignment 53.8 

FFP 7.1 
ICD 0.2 



no zeros appended to either sequence. For example, if the 
sequences GACGACTCAT and TGACGACTCA (the sec- 
ond sequence is equivalent to the first sequence right- 
shifted by 1 with rotation) were both in the set of DNA 
being analyzed and were the longest sequences in the data, 
they will both yield the same X^, X^, X^, and X^. vectors. 
However, the likelihood of two biologically meaningful 
DNA sequences being an entire rotational shift of the 
other is highly unlikely, particularly when analyzing entire 
genes. If this event were to actually occur in nature, then 
our method will yield these sequences as being identical, 
rightfully drawing the attention of the researcher. 

Though we tested several datasets to determine the effi- 
cacy of the method, it does have a limitation worth noting. 
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The ICD method works well when assessing similarity of 
identical genes over many different species, such as the 
INS 19 dataset It also works well when assessing similarity 
over many variants of the same gene from the same spe- 
cies, such as the FLU60 dataset. However, for evaluating 
the similarity over large, genomic regions or entire ge- 
nomes from different species, the ICD method is limited. 
The reason for this is due to the requirement of padding 
zeros to the indicator sequences to ensure all sequences 
have equal length. Sequences that are significantly shorter 
than the longest sequence will likewise have a substantial 
vector of zeros padded and thus will yield comparatively 
poor ICD vectors. 

The results of the ICD method compared favorably to 
other methods tested. In fact, we observed examples in 
the INS 19 test where the ICD analysis yielded a more 
phylogenetically correct tree than those produced from 
other methods tests, backed up by simple phylogenetic 
relationships observed in any biological text. We opted 
to perform an all- against- all pairwise alignment over a 
multiple sequence alignment to ensure the highest de- 
gree of accuracy of the measure of similarity of align- 
ments. In the FLU60 data, the ICD methods ability to 
detect the correct measure of similarity among even 
those sequences that had a high measure of pairwise se- 
quence identity was remarkable. 

A significant disadvantage of alignment-based sequence 
comparison methods is that they assume that important 
regions in the genetic sequence will follow the same order 
between similar species. However, as noted by Pinello 
et al, this is not always the case [7]. As a method based 
on the DFT, the ICD method capitalizes on recurrent pat- 
terns, regardless of the position of those patterns in the 
whole sequence. It is robust to possible reordering of con- 
served regions between genetic sequences. 

The ICD method offers a significant advantage over 
alignment and alignment-free methods by eliminating 
the need for parameters. Other methods often require 
multiple runs to determine the best parameter set. In 
comparison, our ICD method is a pure mathematical, 
alignment-free transformation that requires no user- 
defined parameters prior to the analysis. 

Depending on the alignment algorithm chosen, the run- 
ning time to compare m sequences of length n and pro- 
duce a tree based on alignment methods can vary between 
0(m 2 n 2 ) for ClustalW [6] to as high as 0(n m ) for dynamic 
programming approaches. More recently, Clustal Omega 
implemented substantial improvements over its predeces- 
sors in the Clustal family, improving the running time to 
0(nm log m), making it suitable for large-scale multiple 
sequence alignments. Alignment-free methods often have 
a performance advantage, particularly those that are based 
on /c-mer frequencies. These methods can be run in O 
(knm) time, noting that selection of word size will have an 



effect on the final performance. This is particularly im- 
portant for DNA, which requires longer word lengths for 
meaningful results. In contrast, the FFT runs in 0(mn 
log ri), suggesting that it is an efficient technique, com- 
parable with other alignment-free methods. Our re- 
sults in Table 3 confirm the theoretical running times, 
with the alignment-free methods having a superior ad- 
vantage over the alignment methods. 

Alignment-based methods have their advantages. In 
particular, an alignment will often produce a better abso- 
lute value of evolutionary distance between sequences by 
incorporating a substitution matrix such as BLOSUM62. 
In contrast, it is relatively difficult to infer a precise 
measure of evolutionary distance from alignment-free 
methods, and this is particularly true of the correlation 
computed from the ICD vector. This is not uncommon, 
as this is a limitation with any phylogenetic approach 
that involves computing a distance matrix based on se- 
quence homology. Despite this limitation, most of the 
relative distances observed between species in the INS 19 
dataset and between different variants of avian flu in the 
FLU60 dataset were consistent with the alignments pro- 
duced. More interestingly, we demonstrated a few dif- 
ferences between the results from the methods applied 
to the INS 19 data, where the ICD approach produced 
evolutionary relationships that were more consistent 
with our biological understanding of evolution among 
species that the other approaches we evaluated failed 
to capture. 

The use of a correlation coefficient for distance is part 
of the novel approach in this work. Even though the the- 
oretical value of the Dist computation is [0.0, 2.0], all 
pairs of sequences analyzed had values between 0.0 and 
1.0. In other words, sequences were either found to have 
a strong positive correlation, which is implied for Dist 
values near 0, or no correlation, for Dist values near 1.0. 
Our observations on all tests never observed Dist com- 
putations of more than 1.0. If this had happened, it 
would have implied that the two biological sequences 
being tested had a negative correlation with respect to 
their ICD vector. From a biological viewpoint, different 
species, genes, or even different variants within the 
same genes arise due to evolution; more specifically, 
due to selective pressures placed on the genome to be- 
come more 'fit' than its ancestors. The processes be- 
hind natural selection that are so important for breeding 
new species and genetic functions are not random. 
However, the underlying genetic mutations that occur 
over eons are generally considered to be random 
events [38]. The fact that we never observed a negative 
correlation might offer a metric to numerically confirm 
the random nature of evolution. This needs further in- 
vestigation over a much larger set of genetic data to 
draw any conclusions. 
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Conclusions 

In this paper, we present a novel use of the discrete Fou- 
rier transform to establish sequence similarity through in- 
corporating a simple transform of the coefficient vector. 
We demonstrated its efficacy on two datasets designed to 
measure the method's capability on establishing similarity 
among datasets with different levels of sequence identity. 
The ICD approach produced a high quality dendrogram 
representing phylogenetic relationships of sequences with 
different levels of sequence identity. Our results were 
nearly identical with those obtained using traditional 
alignment-based approaches. 

Additional file 



Additional file 1: Application of the discrete Fourier transform on 
DNA for sequence similarity. Table SI. Avian Flu Sequences (FLU60). 

Figure SI. Histogram of % identity in FLU60. Figure S2. Alignment 
based dendogram for FLU60. Figure S3. FFP based dendogram for FLU60. 
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AAP: all-against-all pairwise global alignment; DFT: discrete Fourier transform; 
DNA: deoxyribonucleic acid; DSP: digital signal processing; FFP: feature 
frequency profile method; FFT: fast Fourier transform; FLU60: dataset of 60 
variants of avian influenza; HA: hemagglutinin (an influenza gene); 
ICD: inter-coefficient difference; INS19: dataset of the insulin gene from 19 
species; MSA: multiple sequence alignment; NA: neuraminidase (an influenza 
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